An expanded phenotype centric benchmark of variant prioritisation tools

Abstract Identifying the causal variant for diagnosis of genetic diseases is challenging when using next‐generation sequencing approaches and variant prioritization tools can assist in this task. These tools provide in silico predictions of variant pathogenicity, however they are agnostic to the disease under study. We previously performed a disease‐specific benchmark of 24 such tools to assess how they perform in different disease contexts. We found that the tools themselves show large differences in performance, but more importantly that the best tools for variant prioritization are dependent on the disease phenotypes being considered. Here we expand the assessment to 37 tools and refine our assessment by separating performance for nonsynonymous single nucleotide variants (nsSNVs) and missense variants (i.e., excluding nonsense variants). We found differences in performance for missense variants compared to nsSNVs and recommend three tools that stand out in terms of their performance (BayesDel, CADD, and ClinPred).

performance, but more importantly that the best tools for variant prioritization are dependent on the disease phenotypes being considered. Here we expand the assessment to 37 tools and refine our assessment by separating performance for nonsynonymous single nucleotide variants (nsSNVs) and missense variants (i.e., excluding nonsense variants). We found differences in performance for missense variants compared to nsSNVs and recommend three tools that stand out in terms of their performance (BayesDel, CADD, and ClinPred).

K E Y W O R D S
dbNSFP, disease, human phenotype ontology, phenotype, variant prioritization Next-generation sequencing for clinical diagnosis of genetic diseases is routinely used, however, filtering and interpreting the tens of thousands (whole exome sequencing) or millions (whole genome sequencing) of variants identified by these approaches remains challenging (Caspar et al., 2018). Variant prioritization tools assist in this task by predicting the likely pathogenicity of variants in silico, thereby enabling ranking and filtering of variants. We previously performed a benchmark study of 24 variant prioritization tools and reported that performance differs depending on the disease phenotype and recommended use of five top performing tools (Anderson & Lassmann, 2018). Here we present an update to our benchmark that incorporates additional variant prioritization tools added to the latest version of dbNSFP (Liu et al., 2020), increasing the number of assessed tools to 37. Furthermore, we refined our assessment by considering the performance of tools for nonsynonymous single nucleotide variants (nsSNVs) and missense variants (i.e., excluding nonsense variants) separately. In total, for missense variants we tested 37 tools across 4890 disease phenotypes and for nsSNVs we tested 22 tools across 5723 disease phenotypes.
Performance of the variant prioritization tools was assessed through creation of disease specific benchmark datasets. To create these datasets we (1) used terms for human phenotypic abnormalities from the Human Phenotype Ontology (HPO) resource (Köhler et al., 2014), (2) obtained the genes associated with each HPO term from the disease to gene mapping tool Phenolyzer (Yang, Robinson, & Wang, 2015) and (3) obtained the pathogenic variants residing in these genes from ClinVar (Landrum et al., 2016). For each HPO term, performance of tools was based on how well they could discriminate pathogenic variants from a set of benign variants (Niroula & Vihinen, 2019) based on the area under the precision-recall curve (auPRC) which is suitable for inherently unbalanced data (i.e., the ratio of pathogenic to benign variants is small). We also assessed each tool based on the proportion of ClinVar pathogenic variants contained in the top 25 variants after ranking by predicted pathogenicity (PP25).
We categorized the variant prioritization tools into those that predict pathogenicity based primarily on (1) conservation scores derived from sequence alignments, (2) machine learning classifiers incorporating a diverse set of functional genomic features and (3) (Figure 1a and Table S1). The types of tools that perform well is more mixed when considering the PP25, with three conservation scores (LRT, phastCons100way, and SIFT), two machine learning scores (MutationTaster and Polyphen2-HDIV) and one ensemble score (BayesDel_addAF) being the best performers ( Figure S1a and Table S2). For nsSNVs, the top performing tools based on the auPRC included both ensemble scores (BayesDel_addAF and BayesDel_noAF) and four of the machine learning scores (CADD, Eigen, Eigen-PC, and VEST4) ( Figure 1b and Table S3). Of note, CADD, Eigen and Eigen-PC were overall weak performers when prioritizing missense variants but were excellent at prioritizing nsSNVs. Again, for PP25, performance is mixed with three conservation scores (LRT, phastCons30way, and phastCons100way), two machine learning scores (CADD and MutationTaster) and two ensemble scores (BayesDel_addAF and BayesDel_noAF) showing very strong performance ( Figure S1b and Table S4 In summary, we found that the best performing variant prioritization tools differ depending on whether they are being used to prioritize missense variants or nsSNVs. Prioritization of missense variants is a more challenging task when compared to nonsense variants as nonsense variants usually affect protein function due to truncation. Whilst missense variants can also cause loss of protein function, the occurrence of this is rarer (around 20%) than that seen for nonsense variants (Kryukov et al., 2007).
The top performing tool in terms of auPRC for both missense variants and nsSNVs was BayesDel_addAF, with strongest performance seen for prioritization of nsSNVs. We also recommend ClinPred, the second best performer for missense variants as it showed consistent performance across a range of disease phenotypes. Whilst CADD was an overall weak performer for prioritizing missense variants, its overall performance for prioritizing nsSNVs was much improved. Hence, we also recommend CADD as a tool for prioritization of nsSNVs.
When considering performance based on PP25, BayesDel_ad-dAF was again a top performer, consistently ranking ClinVar pathogenic variants within the top 25 ranked variants for both missense variants and nsSNVs across most HPO terms. However, in contrast to auPRC, strong performance was seen for conservation scores for both missense variants (LRT, phastCons100way and SIFT) and nsSNVs (LRT, phastCons30way, and phastCons100way). Similarly to the auPRC, CADD was also a strong performer for nsSNVs but not for missense variants.
Performance of the variant prioritization tools differs, even amongst the top performers, across the four top level HPO terms.
Strongest performance for both missense variants and nsSNVs was seen for disease phenotypes associated with Neoplasm (HP:0002664). This is likely due to cancer being a more common disease that is better studied than rare diseases associated with Abnormality of metabolism/homeostasis (HP:0001939), Abnormality of the immune system (HP:0002715) and Abnormality of the nervous system (HP:0000707). This means pathogenic variants related to cancer will be overrepresented when compared to rarer diseases and hence also be overrepresented in training datasets of machine learning and ensemble methods.
Furthermore, this points to the importance of developing tools that prioritize variants in a disease aware manner rather than the agnostic approach of the tools assessed here (Masica & Karchin, 2016).
In line with estimates of auPRC from our previous benchmark study (Anderson & Lassmann, 2018), we find that machine learning scores and ensemble scores show far superior performance than conservation scores when prioritizing variants across disease phenotypes. However, we do note that the training datasets used by machine learning and ensemble methods overlap in terms of the variants being assessed in this benchmark.
This will result in more optimistic auPRC values for these methods in comparison to conservation methods. BayesDel and ClinPred in particular were trained on ClinVar pathogenic variants and given that our benchmark includes the same variants this will be contributing to their strong performance. Therefore, we cannot comment on whether the performance generalizes to yet unseen variants. Regardless of this, machine learning and ensemble methods can be expected to be superior to conservation methods as the pathogenicity of a variant can be predicted based on data that does not directly relate to conservation. Our benchmark is pragmatic in the sense that we focus on how these tools perform when used "out of the box" for the task of

| METHODS
We previously described in detail our automated pipeline to integrate phenotypes with annotated variants (Anderson & Lassmann, 2018).
Therefore, we only briefly describe each component and focus on describing updates to the benchmark.

| Linking candidate genes to causative variants using dbNSFP annotations
The database for nonsynonymous SNPs' functional predictions (dbNSFP) contains annotation for 84,013,490 potential nsSNVs and splicing-site SNVs in the human genome (Liu et al., 2011;Liu et al., 2020).
We used dbNSFP version 4.1a (release 16 June, 2020) which is based on Gencode release 29/Ensembl version 94 Frankish et al., 2019). We selected all variants occurring in the gene lists returned by Phenolyzer. We restricted our analysis to ClinVar (Landrum et al., 2016) "pathogenic" variants that were associated with a single gene. In total we obtained 35,167 pathogenic variants linked to genes associated with disease phenotypes (File S2). Of these, 16,411 were nonsense variants and 18,756 were missense variants.

| Benign variants
We used a set of 63,197 common (allele frequency ≥1% and <25%) missense variants obtained from the Exome Aggregation Consortium (ExAC) database (Niroula & Vihinen, 2019 and annotated with dbNSFP. We removed variants with ClinVar annotation other than "benign" and variants associated with more than one gene. We further filtered the variants to those 29,173 that had scores across all variant prioritization tools and used these in the benchmark analysis (File S3). We assessed the following 22 variant prioritization tools that score nsSNVs: BayesDel (with and without allele frequency) (Feng, 2017) (Sim et al., 2012) and SIFT4G (Vaser et al., 2016). Further detail on the aforementioned tools is available in Table S1 of the dbNSFP v4 publication (Liu et al., 2020). We used the dbNSFP converted rank scores for each tool. We did not assess LINSIGHT (Huang et al., 2017) as this tool is focussed on prioritization of noncoding variants. We also omitted M-CAP (Jagadeesh et al., 2016), MutPred (Pejaver et al., 2020) and MVP (Qi et al., 2021) as these tools were missing scores for a substantial proportion of the benign variants.

| Performance evaluation
We used R package PRROC (Keilwagen et al., 2014) to calculate the area under the precision recall curve (auPRC) based on the interpolation of Davis and Goadrich (Davis & Goadrich, 2006 were produced using the R NMF package (Gaujoux & Seoighe, 2010).

ACKNOWLEDGMENTS
This work was supported by the McCusker Charitable Foundation and the Stan Perron Foundation. Timo Lassmann is supported by a fellowship from the Feilman Foundation. Open access publishing

CONFLICTS OF INTEREST
The authors declare no conflicts of interest.

AUTHOR CONTRIBUTIONS
Denise Anderson: performed analysis, interpreted results and drafted the manuscript. Timo Lassmann: conceived the study, interpreted results and drafted the manuscript.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available in Files S1, S2, and S3. Code used to generate results for this study is available as File S4.